This first section is about cleaning and preparing the data. It’s not very interesting and isn’t very nice to look at, so it’s hidden from this report. However, feel free to view the complete code to see the full cleaning steps and take a look at the notes I left about why I made the corrections I did. But for now, let’s get straight to the fun part!
To find the best markets, we’ll examine a few different factors. First, let’s see where agents are converting leads into the most sales.
Based on closing rates alone, it looks like the best markets are Dallas, Louisville, Raleigh, Nashville, and Cincinnati. But there’s more to the story than just closing rates. Lots of factors can affect a market’s performance – let’s keep digging!
Other important factors to consider are how long property stays on the market (DoM), the average closing price, the turnover rate, and how often a listing price increases.
When we take all those metrics into account, the best markets seem to be New York, Los Angeles, Chicago, Dallas, Houston, and Washington, D.C.
| Market | Population | Customers | Median DoM | Median Price | Price Increase Rate | Avg Buyer Close Rate | Avg Seller Close Rate |
|---|---|---|---|---|---|---|---|
| new_york_ny | 19,756,722 | 967 | 58 | $785,000 | 1.0% | 16.2% | 19.7% |
| los_angeles_ca | 13,012,469 | 1,012 | 55 | $915,000 | 1.8% | 16.3% | 17.9% |
| chicago_il | 9,359,555 | 982 | 52 | $360,000 | 1.0% | 18.0% | 19.6% |
| dallas_tx | 7,807,555 | 1,029 | 64 | $408,386 | 1.5% | 17.8% | 24.0% |
| houston_tx | 7,274,714 | 993 | 60 | $329,900 | 3.8% | 16.6% | 20.0% |
But, wait! Those are all also large population centers. Let’s check where the best markets are after we control for population. We can standardize each metric and assign an overall performance score to evaluate it independent of population.
| Rank | Score | Market | Population | Customers | Median DoM | Median Price | Price Increase Rate | Avg Buyer Close Rate | Avg Seller Close Rate |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 7.52 | san_jose_ca | 1,969,353 | 958 | 17 | $1,632,174 | 3.9% | 14.9% | 19.7% |
| 2 | 5.60 | buffalo_ny | 1,161,385 | 1,003 | 12 | $290,000 | 1.6% | 18.0% | 17.2% |
| 3 | 5.56 | grand_rapids_mi | 1,154,320 | 1,026 | 14 | $355,500 | 0.4% | 16.6% | 21.7% |
| 4 | 4.91 | san_diego_ca | 3,282,782 | 998 | 38 | $899,450 | 2.4% | 17.9% | 19.4% |
| 5 | 4.34 | san_francisco_ca | 4,653,593 | 1,045 | 21 | $1,482,500 | 2.0% | 15.7% | 19.8% |
California and (and New York, to a lesser extent) still may be overrepresented here. The previous scoring was still weighted heavily to the wealthiest and most densely populated states, despite normalization. But where is Smiley Real Estate most efficient rather than merely depending on population and wealth factors? The following analysis weights each factor according to importance: closing rates and customer saturation are weighted more highly, while median closing price is weighted lower to favor efficiency over property value or population.
| Rank | Score | Market | Population | Customers | Median DoM | Median Price | Price Increase Rate | Avg Buyer Close Rate | Avg Seller Close Rate |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 6.99 | grand_rapids_mi | 1,154,320 | 1,026 | 14 | $355,500 | 0.4% | 16.6% | 21.7% |
| 2 | 5.75 | raleigh_nc | 1,449,594 | 1,018 | 48 | $437,500 | 3.1% | 20.2% | 20.5% |
| 3 | 5.53 | buffalo_ny | 1,161,385 | 1,003 | 12 | $290,000 | 1.6% | 18.0% | 17.2% |
| 4 | 5.23 | san_jose_ca | 1,969,353 | 958 | 17 | $1,632,174 | 3.9% | 14.9% | 19.7% |
| 5 | 5.16 | louisville_ky | 1,361,847 | 987 | 44 | $280,000 | 0.8% | 18.4% | 22.8% |
After accounting for population and high property values, the top five markets are Grand Rapids, Raleigh, Buffalo, San Jose, and Louisville. These markets have the best balance between liquidity (they don’t spend a lot of time on the market), listing price (with the exception of San Jose, they present a low barrier to entry for a buyer), listing price increase rates, and agent closing rates, and thus represent the best opportunity to maximize return on investment (ROI).
Markets with the highest turnover are the least likely to face spiking listing prices. This suggests that the highest volume markets are highly mature and efficient, so buyers and sellers face less volatility in these markets.
In contrast, low turnover markets have the highest price volatility. These markets may be experiencing “boom” conditions due to a sudden lack of supply or short-term enthusiasm, making them riskier. An investor is most likely to find consistent, steady returns – rather than speculative gains – in the high-turnover, low-volatility market.
One notable exception is Portland, OR. Despite having a moderate turnover rate, Portland exhibits an anomalously high rate of listing price increases (noted as an outlier in the cleaning section and as seen on the chart). This extreme volatility, in contrast to the typical market maturity trend, suggests that Portland’s price appreciation may be driven by external, non-liquidity factors (such as regulatory constraints or unique geographic supply issues). For an investor, this market represents high potential reward but with significantly less predictability than other high-turnover markets.
What makes an agent the best is just as nuanced as what makes a market the best. Let’s look at some of the key factors! To determine the best agent in each market, we’ll look at their number of ratings, their average rating, how many deals they’ve closed, and the percentage of leads they’ve converted to sales.
| Rank | Market | Agent | Score | Weighted Rating | Total Closes | Buyer Rate | Seller Rate | Total Reviews |
|---|---|---|---|---|---|---|---|---|
| 1 | boston_ma | mary_baker_972 | 9.94 | 18.67 | 76 | 11.4% | 40.0% | 14 |
| 2 | riverside_ca | carlos_young_28 | 8.84 | 11.30 | 74 | 30.3% | 43.6% | 7 |
| 3 | raleigh_nc | raj_hill_659 | 8.40 | 14.52 | 46 | 25.0% | 41.0% | 9 |
| 4 | dallas_tx | emily_adams_50 | 8.33 | 12.87 | 55 | 29.3% | 37.9% | 11 |
| 5 | los_angeles_ca | camila_robinson_734 | 8.25 | 13.43 | 72 | 30.1% | 22.9% | 14 |
| 6 | indianapolis_in | isabella_johnson_130 | 7.40 | 16.11 | 74 | 16.6% | 23.0% | 12 |
| 7 | jacksonville_fl | luis_perez_521 | 7.26 | 11.28 | 73 | 22.3% | 40.7% | 11 |
| 8 | philadelphia_pa | jin_campbell_783 | 7.21 | 13.12 | 70 | 14.8% | 44.1% | 10 |
| 9 | miami_fl | lakshmi_mohamed_440 | 7.06 | 16.04 | 71 | 14.1% | 25.5% | 12 |
| 10 | virginia_beach_va | minh_mohamed_224 | 6.94 | 14.79 | 60 | 28.6% | 11.8% | 12 |
The table above highlights the top 10 agents overall. For a complete listing of the best agent in every market analyzed, the complete, ranked list can be downloaded here.
So what makes these agents the best?
Though the number of customer reviews was weighted down in this analysis, it’s clear that customer engagement matters! All but one of the top agents had higher than the average number of customer reviews. The average number of customer reviews per agent overall is 4.6, while the top agents had an average of 8.9 reviews – almost double the customer engagement rate!
The best agents are also able to maintain a steady volume of customers. Volume isn’t the main driver, but more (engaged!) customers means more sales! All but two of the top agents closed as many or more sales than average. On average, top agents closed 50% more sales than the average agent!
Over half of top agents convert more leads into sales than average. The best agents are able to close 25-35% more deals than average!
The bottom line? The best agents are the ones who can effectively engage with a high volume of customers without sacrificing quality. More customers means more opportunities, but only if you can keep them happy!
It’s tempting to want to stick with what you know – lots of people and an expensive market means lots of opportunity for high-commission sales! However, there’s opportunity beyond wealthy, densely populated markets like New York City and Los Angeles.
As discussed above, New York City and Los Angeles are two of the most successful markets right now. However, highest value doesn’t always mean most efficient. Investing more resources in high-efficiency markets, like Grand Rapids, MI or Raleigh, NC, could lead to higher turnover, the opportunity for more – if smaller – sales, and the chance to connect with more people by being able to close more deals in a shorter amount of time.
As a case study, let’s compare New York City (a high value market) to Grand Rapids (a highly efficient market).
New York requires 121% more capital for a transaction than Grand Rapids, significantly hindering liquidity and sales velocity. A property in New York will also sit four times longer on average than a property in Grand Rapids. This means that a salesperson could theoretically sell four properties in Grand Rapids for the same amount of time and energy it takes to sell just one property in New York – based on median sales prices at close, that’s a gross transaction value difference of $637,000 in Grand Rapids over New York! With a lower barrier to entry and higher turnover, Grand Rapids has 88% higher market activity than New York. Investing more resources into smaller, highly efficient markets like Grand Rapids will lead to a higher ROI than focusing exclusively on large markets like New York.
The analysis could be improved by moving beyond static rate comparisons and into predictive modeling and geospatial segmentation. There are three primary ways I would improve the above analysis, given the time and resources.
Instead of only comparing current averages, I would build a model to predict the future success (e.g. close rate, tenure, revenue) of a new agent entering a specific market. This would separate agent quality from market quality. To build a successful model, I would need agent profile data to learn historical agent demographics, training scores, prior industry experience, and agent-specific lead sources. I would also use lead quality metrics to tag leads by source (e.g., paid digital, referral, organic search) and initial qualification score. Lastly, I would use a model output to create a propensity score for each market and agent profile, showing the predicted probability of an agent achieving a top-quartile close rate within 12 months.
The current analysis treats each city as a monolithic market. In reality, a city like New York is composed of highly distinct micro-markets (e.g. Manhattan vs. Brooklyn vs. Long Island). With more time I could use geospatial boundary data to create detailed maps of county, zip code, or neighborhood service areas for each transaction. I could then also add more specific local economic indicators like median income, population density, average home sale price, and inventory days on market to those micro-markets. With these data, I could dive much deeper than the current analysis. For example, I could visualize data like the top ten highest-ROI zip codes, which would enable us to find high efficiency zip codes and focus sales to maximize local ROI.
The current metrics are static averages taken from a snapshot of data. To understand how agent performance, customer saturation, and growth rates fluctuate over time, I could use historical data to track transaction volumes, lead flow, agent performance, etc and analyze how those metrics have changed over the years. This would provide the necessary context to distinguish a temporary spike from a long-term trend. Having time-stamped external factors like local interest rate data, state or local regulatory changes, or major economic events would also help make the analysis more robust; for example, these data could help explain the outliers in listing price decreases in Columbus and Indianapolis (for more details, see the outliers section in the cleaning portion of the complete code). A time-series analysis could point to any trends in seasonality or other peak performance windows.
The overall process followed the standard data analysis workflow: data cleaning and preparation, analytical modeling, and communication. The entire project was executed primarily using R, leveraging packages like tidyverse for data manipulation and ggplot2, plotly, and kableExtra for visualization and reporting.
If the data isn’t clean, all of my analyses could give me wildly different results than expected. When I clean data, I make sure to address a few key areas: data type consistency, missingness, potential spelling errors or typos that could lead to aggregation errors, logical tests (e.g. dates shouldn’t be in the future), and outliers. One challenge in data cleaning is deciding how to handle data that doesn’t pass my checks. Sometimes data can be imputed (e.g. a rating score can be imputed with the median for a dataset like this one with low missingness because it won’t introduce skew). Sometimes missing values can be safely deleted. Sometimes outliers are an obvious data entry error, but without access to the source data, that can be hard to verify. Ultimately, I have to use my critical thinking skills to make the best decision for each dataset to maintain data integrity.
After the data is clean, it’s on to the fun part! To perform the analyses, I think critically about what is being asked and the best metric to answer those questions. For this project I primarily used R, but I have also used Python, Excel, SQL, Tableau, and Infogram in other projects.
I started by creating derived metrics like avg_buyer_close_rate and customer_saturation_rate by grouping the raw agents and customers data by market_id. This established the foundation for subsequent comparisons.
To compare markets and agents fairly, I used the scale() function in R to convert raw metrics (like median price, close rates, etc.) into Z-scores. This process, called normalization, controlled for the sheer size and wealth of major cities (like New York or Los Angeles), ensuring the resulting scores were based on performance relative to the average, not just the absolute numbers.
I then created the final Composite Performance Scores for both markets and agents by assigning weights to these Z-scores. For example, in the Efficiency Market Analysis, I intentionally gave lower weight to the Z-score for price_closed_median and higher weight to close rates to favor operational efficiency over market wealth. For the Agent Analysis, I gave the highest weight to the weighted_rating to favor agents with proven customer satisfaction.
Finally, I used the ggplot2, plotly, and kableExtra packages to create the visualizations and clean tables, ensuring that the highest performers were easily identifiable.
One of the most significant challenges was the inherent multi-collinearity and size bias in the raw data. My first attempt at ranking the markets showed the largest cities (like New York and Los Angeles) at the top. This analysis was ok, but it seemed shallow and pointed to the fact that data based on population size had not been accounted for well enough, despite previous attempts at normalization. This confirmed that raw size was overpowering other metrics.
I addressed the challenge by enforcing the standardization step and then adjusting the weights (e.g., reducing the weight of volume metrics and increasing the weight of efficiency metrics) to successfully decouple operational success from raw market size.
Analyses don’t mean much if no one can understand them. After the analyses, I clean up the report. I make sure the report itself is formatted and easy to read. I put the finishing touches on all the visualizations to make sure the data I want to stand out really pops. Graphic design and intentional formatting can go a long way to enhancing communication, especially for subjects that can otherwise be dry or tricky to understand.
Throughout the whole process, I make sure my work is saved and backed up. I use Github to version control my work, as well as to provide a secondary backup.